Problem Statement 1: Exploring Realtor dataset This dataset is from Kaggle and contains both categorical/discrete (nominal and ordinal) and numeric (continuous) variables scraped from www.realtor.com real estate website. The data has over 900K observations (houses) and 12 columns (various attributes of houses). The goal is to explore the price variable and find association between house attributes and its price.
# Let's read the data from the dataset
realtor = read.csv("/Users/nancy/Downloads/realtor-1.csv")
realtor
# Exploring the overall structure using str and summary functions as per the ques requirement:
str(realtor)
'data.frame': 923159 obs. of 12 variables:
$ status : chr "for_sale" "for_sale" "for_sale" "for_sale" ...
$ price : num 105000 80000 67000 145000 65000 179000 50000 71600 100000 300000 ...
$ bed : num 3 4 2 4 6 4 3 3 2 5 ...
$ bath : num 2 2 1 2 2 3 1 2 1 3 ...
$ acre_lot : num 0.12 0.08 0.15 0.1 0.05 0.46 0.2 0.08 0.09 7.46 ...
$ full_address: chr "Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601" "Km 78 9 Carr # 135, Adjuntas, PR, 00601" "556G 556-G 16 St, Juana Diaz, PR, 00795" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce, Ponce, PR, 00731" ...
$ street : chr "Sector Yahuecas Titulo # V84" "Km 78 9 Carr # 135" "556G 556-G 16 St" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce" ...
$ city : chr "Adjuntas" "Adjuntas" "Juana Diaz" "Ponce" ...
$ state : chr "Puerto Rico" "Puerto Rico" "Puerto Rico" "Puerto Rico" ...
$ zip_code : num 601 601 795 731 680 612 639 731 730 670 ...
$ house_size : num 920 1527 748 1800 NA ...
$ sold_date : chr "" "" "" "" ...
summary(realtor)
status price bed bath acre_lot full_address street
Length:923159 Min. : 0 Min. : 1.00 Min. : 1.00 Min. : 0.00 Length:923159 Length:923159
Class :character 1st Qu.: 269000 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.11 Class :character Class :character
Mode :character Median : 475000 Median : 3.00 Median : 2.00 Median : 0.29 Mode :character Mode :character
Mean : 884123 Mean : 3.33 Mean : 2.49 Mean : 17.08
3rd Qu.: 839900 3rd Qu.: 4.00 3rd Qu.: 3.00 3rd Qu.: 1.15
Max. :875000000 Max. :123.00 Max. :198.00 Max. :100000.00
NA's :71 NA's :131703 NA's :115192 NA's :273623
city state zip_code house_size sold_date
Length:923159 Length:923159 Min. : 601 Min. : 100 Length:923159
Class :character Class :character 1st Qu.: 2919 1st Qu.: 1130 Class :character
Mode :character Mode :character Median : 7004 Median : 1651 Mode :character
Mean : 6590 Mean : 2142
3rd Qu.:10001 3rd Qu.: 2499
Max. :99999 Max. :1450112
NA's :205 NA's :297843
# We can deduce the type of variables with the str() function used above too, but we can also use the "class" function to specify if the variable is categorical or numerical.
class(realtor$status)
[1] "character"
class(realtor$price)
[1] "numeric"
class(realtor$bed)
[1] "numeric"
class(realtor$bath)
[1] "numeric"
class(realtor$acre_lot)
[1] "numeric"
class(realtor$full_address)
[1] "character"
class(realtor$street)
[1] "character"
class(realtor$city)
[1] "character"
class(realtor$state)
[1] "character"
class(realtor$zip_code)
[1] "numeric"
class(realtor$house_size)
[1] "numeric"
class(realtor$sold_date)
[1] "character"
# To specify, if its nominal or ordinal in the qualitative variables.
is.ordered(realtor$status)
[1] FALSE
is.ordered(realtor$full_address)
[1] FALSE
is.ordered(realtor$street)
[1] FALSE
is.ordered(realtor$city)
[1] FALSE
is.ordered(realtor$state)
[1] FALSE
is.ordered(realtor$sole_date)
[1] FALSE
# To understand if the numeric values are discrete or continuous, we can use the histogram plotting concept to visually reveal the distribution of values. Discrete variables might show distinct bars, while continuous variables may exhibit a smoother distribution.
hist(realtor$price, main = "Histogram")
hist(realtor$bed, main = "Histogram")
hist(realtor$bath, main = "Histogram")
hist(realtor$acre_lot, main = "Histogram")
hist(realtor$zip_code, main = "Histogram")
hist(realtor$house_size, main = "Histogram")
# We are trying a method - equidistant intervals, For an interval scale, the differences between consecutive values should be approximately equal and if not then there is not interval scale between them
# For bed variable:
unique_values <- unique(realtor$bed)
bed_diff <- diff(unique_values)
print(bed_diff)
[1] 1 -2 4 -1 -4 8 NA NA 1 4 1 -3 1 22 -9 4 -14 4 2 -4 -1 4 -2 23 -19 65 -55 -4 15 18 -38 10 67 -50 -20
[36] 1 -7 23 -10 32 55 -98 22
if (all(bed_diff == bed_diff[1])) {
print("Interval scale.")
} else {
print("No interval scale")
}
[1] "No interval scale"
# For bath variable:
unique_values <- unique(realtor$bath)
bath_diff <- diff(unique_values)
print(bath_diff)
[1] -1 2 2 -1 3 -1 NA NA 1 1 2 1 22 -24 5 -1 3 2 -6 22 -11 -8 2 37 -14 9 -23 170
[29] -176 11 -6 3 -1 -5 22 -25 102 -84 4
if (all(bath_diff == bath_diff[1])) {
print("Interval scale.")
} else {
print("No interval scale")
}
[1] "No interval scale"
# For zip_code variable:
unique_values <- unique(realtor$zip_code)
zip_diff <- diff(unique_values)
print(zip_diff)
[1] 194 -64 -51 -68 27 91 -60 -8 7 -28 57 30 -43 -75 66 93 -53 -60 110
[20] -49 40 -133 35 -57 1 3 54 -7 -30 59 1 -6 -10 7 13 -70 61 15
[39] 34 -100 94373 -94354 303 4 -261 91 -64 -97 14 351 -350 1 NA NA -65 312 -326
[58] 2 310 -89 -96 -130 14 54 5 252 -9 -350 181 -69 40 -27 44 169 -257 -89
[77] 100 -17 277 -44 -137 185 -45 2 -223 51 24 178 -36 7 -190 229 -230 3 -25
[96] 9 238 21 -25 -185 213 -26 -52 5 83 -288 84 -24 -49 53 -30 -15 245 -220
[115] 150 10 7 52 -55 68 -58 3 -11 50 -222 234 -70 -174 223 2 -37 -13 -114
[134] 172 -22 -17 -16 61 -20 -162 -162 143 -52 54 27 -22 50 -10 20 10 56 30
[153] 65 1 339 -3 -305 268 -239 314 -225 -58 282 -355 -13 6 62 -40 69 -48 -34
[172] 5 337 -324 33 -13 -55 67 -40 6 304 -289 320 -29 -13 20 28 -283 259 -273
[191] 272 -285 19 -34 276 -245 4 281 -358 -4 5271 -41 -5129 4907 50 -26 2 32 -4966
[210] 4921 -5001 77 -48 531 4488 -4958 4964 -4614 -339 -119 1 15 8 64 125 -193 223 3
[229] -179 5019 -5065 554 -338 -27 -170 5182 -144 -5005 2364 2637 -41 28 -4976 5136 -5195 45 -44
[248] 5211 -4717 4916 -5376 -1 200 4790 -4825 4115 -4011 -114 18 -17 124 -249 100 4834 -4768 4796
[267] -4826 215 -195 -2 4843 -72 -4796 4860 -4845 14 4819 -20 -4794 4754 3 735 -5649 4986 2
[286] -4873 -116 132 4832 -4808 4769 -12 -4652 4714 -4845 4788 -4780 -141 438 -473 298 169 -62 70
[305] -451 -12 4982 -61 31 25 -55 11 79 -7 -75 2006 -2655 6661 -5533 -400 -4970 9 5122
[324] -4913 4900 -4869 68 4988 -306 -10 160 -235 -2563 -1949 1944 2608 -4695 60 77 -14 1950 1850
[343] 53 -4014 4012 -91 -4021 322 -55 83 -150 97 -13 88 -8 -38 36 -62 70 1 -84
[362] 80 -101 99 -2 3 -84 -2 28315 -28252 9 -32 49 -64 -3 -6 -106 178 3750 -3808
[381] -32 4724 36 512 -556 47 -4710 4664 24 23 -27 -8 15 -1061 162 897 17 -4707 4671
[400] -4766 5316 -674 3 -8 114 23 -143 -1 14 -16 11 -7 497 -135 -47 -313 -56 58
[419] 346 6064 -5 1 12 17 -6507 30 -4803 -12 10763 12 94 -33 -30 124 -5389 5335 -24
[438] 459 -5814 8 -5501 11255 -328 -5422 5449 -5425 5381 -30 -110 66 -5300 5331 -6057 6890 -805 -135
[457] 19 -5241 5228 1428 -949 -334 -107 -5264 5342 -83 87942 -87863 -61 -6818 7559 -6062 5328 -5305 5286
[476] -5313 5430 -5427 5280 469 -441 473 -369 -5372 -14 5353 -43 -6749 7065 -290 -5337 5418 -5434 5290
[495] 476 -6113 1 -379 443 12 -441 390 -1099 1095 -385 398 -383 43 313 49 -21 4 -17
[514] -1094 1099 38 6042 69 -5807 5760 -44 21 -8 57 -5814 5824 13 -27 -114 51 74 -7222
[533] -1894 -31 12 3 23 1877 -1894 15 -22 3 7 6 5 -22 14 -8 1708 -1701 1894
[552] -10 14 -10 -3925 11 1 2 315 1335 -1631 368 -101 21 -278 -13 114 -61 20 -72
[571] 268 57 110 -132 1322 -1224 1210 -1598 390 25 -417 43 270 49 -314 230 123 -111 74
[590] 1225 -1200 1198 11 -1600 390 -132 109 -85 -210 -63 5 402 -14 1209 -23 16 -31 -540
[609] 562 -1353 -121 901 -781 102 650 633 -33 -22 79 171 -37 -162 -39 397 -383 385 -372
[628] 379 -386 -16 -1275 1331 -49 -16 71 -57 -1484 -49 227 2 -179 -53 262 35982 -36039 -187
[647] 223 -199 215 -43 691 31 -32 31 -561 -89 675 -716 721 -637 1235 8 -1277 -306 252
[666] 1074 65 1 3524 -4891 1330 20477 -21589 1111 -19 3532 -91 -14 -10 8 72 -3522 95 -21
[685] 33 -16 9 1 -4 -151 159 -81 14 41 -36 -803 649 158 -90 60 33 3490 -3512
[704] -804 -19 37 -3 -283 1095 -772 669 40 -772 878 -92 -10 111 -7 -50 60 -16 -15
[723] 18 -1 -89 68 3354 138 -3549 -3 3548 -3481 -18 -11 -45 3566 -3552 54 17 -76 8
[742] 3539 -3565 20 3530 -3590 53 50 26 18 -881 838 -821 819 -59 -764 715 -583 286 -318
[761] -4 301 -313 3 -5 -15 16 18 309 34 -33 -341 58 289 -266 -70 695 -762 -221
[780] 2 2 310 60 257 -26 -540 237 -27 650 -685 53 340 -7 -359 33 337 -18 -580
[799] 258 -271 247 31 6 -11 -253 263 -286 275 -19 348 -528 521 1 32 -375 -93 -77
[818] 172 -31 -257 88 543 -397 5 52 -103 101 5 59 -58 -17 -281 295 -15 100 -101
[837] 27 -10 -13 203 -356 408 -336 324 19 -543 2 1 5 20 53 1952 9 -1922 1896
[856] -1863 1841 7 -1864 1858 -1876 1860 -1873 -23 68 1 -61 59 1062 841 -66 23 -765 796
[875] -1923 1891 -1919 1120 816 -1888 -69 1 2 1 2 1 280 -4 -118 123 -146 300 -414
[894] 1 14 225 2 -146 20 145 -140 -84 -21 207 -91 130 -21 181 -12 31 1492 8
[913] -26 -1465 -41 26 -5 -27 -31 1 80 -22 420 -13 -445 11 446 -455 6 387 -380
[932] 383 -373 -1 418 -46 -351 -26 436 -75 13 11 19 -2 -205 232 -212 18 -44 206
[951] -167 -9 -200 197 180 -186 -19 98 108 -177 -26 111 -112 65 34 14 19 -128 115
[970] -120 4 134 -130 25 151 -162 191 -104 -90 2 91 -17 -55 100 -3 -8 -30 112
[989] -182 69 111 -113 29 -22 3 -2 31 -20 -8 4
[ reached getOption("max.print") -- omitted 2191 entries ]
if (all(zip_diff == zip_diff[1])) {
print("Interval scale.")
} else {
print("No interval scale")
}
[1] "No interval scale"
# Using the duplicated() function
realtor <- realtor[!duplicated(realtor), ]
print(realtor)
text <- "The number of duplicate rows are: "
print(paste(text,nrow(duplicate_rows)))
[1] "The number of duplicate rows are: 809370"
summary(realtor)
status price bed bath acre_lot full_address street
Length:113789 Min. : 0 Min. : 1.000 Min. : 1.000 Min. : 0.00 Length:113789 Length:113789
Class :character 1st Qu.: 250000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 0.11 Class :character Class :character
Mode :character Median : 449900 Median : 3.000 Median : 2.000 Median : 0.26 Mode :character Mode :character
Mean : 909606 Mean : 3.309 Mean : 2.521 Mean : 17.74
3rd Qu.: 800000 3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.: 1.03
Max. :875000000 Max. :123.000 Max. :198.000 Max. :100000.00
NA's :18 NA's :17516 NA's :16297 NA's :31123
city state zip_code house_size sold_date
Length:113789 Length:113789 Min. : 601 Min. : 100 Length:113789
Class :character Class :character 1st Qu.: 6010 1st Qu.: 1152 Class :character
Mode :character Mode :character Median : 8005 Median : 1664 Mode :character
Mean : 8267 Mean : 2163
3rd Qu.:10301 3rd Qu.: 2499
Max. :99999 Max. :1450112
NA's :33 NA's :36448
realtor <- realtor[realtor$price > 50000, ]
print(realtor)
# Let's calculate the IQR through the formula. As price has missing values, we'll use na.rm=TRUE
iqr <- IQR(realtor$price, na.rm = TRUE)
q1 <- quantile(realtor$price, 0.25, na.rm = TRUE) - 1.5 * iqr
q3 <- quantile(realtor$price, 0.75, na.rm = TRUE) + 1.5 * iqr
print(iqr)
[1] 560000
realtor <- realtor[!(realtor$price < q1 | realtor$price > q3), ]
print(realtor)
NA
hist(realtor$price, main = "The histogram for the price variable", xlab = "Price", ylab = "Frequency")
# The boxplot for this is:-
boxplot(realtor$price, main = "The boxplot for the price variable", ylab = "Price")
missing_observation <- mean(is.na(realtor$price)) * 100
print(missing_observation)
[1] 0.01821789
# Let's take the date format as Year-Month-date
realtor$sold_date <- as.Date(realtor$sold_date, format = "%Y-%m-%d")
print(realtor)
# Using the format function to extract the year and month from the sold_date attribute and created two new attributes to store it in the numeric form
realtor$sold_year <- as.numeric(format(realtor$sold_date, "%Y"))
realtor$sold_month <- as.numeric(format(realtor$sold_date, "%m"))
print(realtor)
realtor$state <- factor(realtor$state)
print(realtor)
summary(realtor$state)
Connecticut Delaware Georgia Maine Massachusetts New Hampshire New Jersey New York Pennsylvania
12674 1262 5 4012 8673 3234 30363 21682 8513
Puerto Rico Rhode Island Vermont Virgin Islands Virginia West Virginia Wyoming NA's
2298 3249 2206 606 7 1 1 18
# To check the number of observation for the West Virginia State
value <- realtor[realtor$state == "West Virginia", ]
print(value)
realtor <- realtor[realtor$state != "West Virginia", ]
realtor <- realtor[realtor$state != "Wyoming", ]
print(realtor)
anova_res <- aov(realtor$price ~ realtor$state, data = realtor)
summary(anova_res)
Df Sum Sq Mean Sq F value Pr(>F)
realtor$state 13 1.466e+15 1.128e+14 1064 <2e-16 ***
Residuals 98770 1.047e+16 1.060e+11
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18 observations deleted due to missingness
12.(1pt) What is the correlation between house_price and the variables sold_year, house_size, bed, and bath? Note: The “cor” function returns error when NAs are present in the variables. Set use=“pairwise.complete.obs” inside the “cor” function to ignore NAs when computing correlation coefficient between a pair of variables
correlation <- cor(realtor[, c("price", "sold_year", "house_size", "bed", "bath")],use="pairwise.complete.obs" )
print(correlation)
price sold_year house_size bed bath
price 1.000000000 -0.001095265 0.17842388 0.20480875 0.4169943
sold_year -0.001095265 1.000000000 -0.03343316 -0.07504495 -0.0461646
house_size 0.178423878 -0.033433163 1.00000000 0.34031245 0.3486002
bed 0.204808748 -0.075044948 0.34031245 1.00000000 0.6441460
bath 0.416994304 -0.046164604 0.34860019 0.64414601 1.0000000
Problem2 — Exploring Heart Disease Dataset In this problem, you are going to explore the heartz disease dataset from UCI. This dataset contains 76 attributes but only 14 of them are relevant and used in publications. These 14 attributes are already processed and extracted from the dataset. Click on Data Folder and download the four processed datasets: processed.cleveland.data, processed.hungarian.data, processed.switzerland.data, processed.va.data.
cleveland = read.csv("/Users/nancy/Downloads/processed.cleveland.data", na.strings = "?", header = FALSE )
cleveland
hungarian = read.csv("/Users/nancy/Downloads/processed.hungarian.data",na.strings = "?", header = FALSE )
hungarian
switzerland = read.csv("/Users/nancy/Downloads/processed.switzerland.data", na.strings = "?", header = FALSE)
switzerland
va = read.csv("/Users/nancy/Downloads/processed.va.data",na.strings = "?", header = FALSE)
va
# Let's first combine these four data frames into one
heart_disease <- rbind(cleveland, hungarian, switzerland, va)
# Now, manually setting the 14 columns according to the document provided:
colnames(heart_disease) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
print(heart_disease)
# Exploring the overall structure of the data set using summary function
summary(heart_disease)
age sex cp trestbps chol fbs restecg thalach
Min. :28.00 Min. :0.0000 Min. :1.00 Min. : 0.0 Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. : 60.0
1st Qu.:47.00 1st Qu.:1.0000 1st Qu.:3.00 1st Qu.:120.0 1st Qu.:175.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
Median :54.00 Median :1.0000 Median :4.00 Median :130.0 Median :223.0 Median :0.0000 Median :0.0000 Median :140.0
Mean :53.51 Mean :0.7891 Mean :3.25 Mean :132.1 Mean :199.1 Mean :0.1663 Mean :0.6046 Mean :137.5
3rd Qu.:60.00 3rd Qu.:1.0000 3rd Qu.:4.00 3rd Qu.:140.0 3rd Qu.:268.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:157.0
Max. :77.00 Max. :1.0000 Max. :4.00 Max. :200.0 Max. :603.0 Max. :1.0000 Max. :2.0000 Max. :202.0
NA's :59 NA's :30 NA's :90 NA's :2 NA's :55
exang oldpeak slope ca thal num
Min. :0.0000 Min. :-2.6000 Min. :1.000 Min. :0.0000 Min. :3.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:0.0000
Median :0.0000 Median : 0.5000 Median :2.000 Median :0.0000 Median :6.000 Median :1.0000
Mean :0.3896 Mean : 0.8788 Mean :1.771 Mean :0.6764 Mean :5.088 Mean :0.9957
3rd Qu.:1.0000 3rd Qu.: 1.5000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:7.000 3rd Qu.:2.0000
Max. :1.0000 Max. : 6.2000 Max. :3.000 Max. :3.0000 Max. :7.000 Max. :4.0000
NA's :55 NA's :62 NA's :309 NA's :611 NA's :486
str(heart_disease)
'data.frame': 920 obs. of 14 variables:
$ age : num 63 67 67 37 41 56 62 57 63 53 ...
$ sex : num 1 1 1 1 0 1 0 0 1 1 ...
$ cp : num 1 4 4 3 2 2 4 4 4 4 ...
$ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
$ chol : num 233 286 229 250 204 236 268 354 254 203 ...
$ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
$ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
$ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
$ exang : num 0 1 1 0 0 0 0 1 0 1 ...
$ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ slope : num 3 2 2 3 1 1 3 1 2 3 ...
$ ca : num 0 3 2 0 0 0 2 0 1 0 ...
$ thal : num 6 3 7 3 3 3 3 3 7 7 ...
$ num : int 0 2 1 0 0 0 3 0 2 1 ...
# Finding number of rows having missing values in one or more attributes are:
missing_rows <- sum(rowSums(is.na(heart_disease)) > 0)
print(missing_rows)
[1] 621
# The % would be:
missing_rows_perc <- (missing_rows/920) * 100
print(missing_rows_perc)
[1] 67.5
So, After going through the document, the observations made are: 1. The variables: 1. sex 2. cp 3. fbs which is fasting blood sugar it can be greater than 120 or less hence creating two classes 4. restcg 5. exang 6. slope 7. thal Overall we have 7 attributes being categorical and the other 7 being numerical.
Talking about the qualitative variables being nominal or ordinal: So, after going through the description of the attributes Nominal Variables: 1. sex 2. cp, as the four types includes Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic. There is no particular order for this kind 3. fbs having two types of values; greater than 120 or less 4. exang
Ordinal Variables: 1. restcg, according to the three types there is a particular order being shown 2. slope 3. thal: having a number to specify an order
Talking about the numeric values being discrete or continuous: So, after going through the description of the attributes Discrete Variables: 1. age 2. ca 3. num, as mentioned in our description it seems categorical having two kinds of values in it
Continuous Variables: 1. trestbps 2. chol 3. thalach 4. oldpeak
print(heart_disease)
# using the factor function:
unique(heart_disease$restcg)
NULL
heart_disease$sex <- factor(heart_disease$sex, labels = c("F", "M"))
heart_disease$cp <- factor(heart_disease$cp, labels = c("Typical Angina", "Atypical Angina", "Non-anginal Pain", "Asymptomatic"))
heart_disease$fbs <- factor(heart_disease$fbs, labels = c("False", "True"))
heart_disease$exang <- factor(heart_disease$exang, labels = c("No", "Yes"))
heart_disease$slope <- factor(heart_disease$slope, labels = c("Upsloping", "Flat", "Downsloping"))
heart_disease$thal <- factor(heart_disease$thal, labels = c("Normal", "Fixed Defect", "Reversable Defect"))
print(heart_disease)
mode_age <- max(heart_disease$age)
median_age <- median(heart_disease$age)
print(paste("The mode is", mode_age))
[1] "The mode is 77"
print(paste("The median is", median_age))
[1] "The median is 54"
# column 14 is num
heart_disease$diagnosis <- factor(ifelse(heart_disease$num == 0, "No", "Yes"))
print(heart_disease)
# Replacing this column with new diagnosis column
heart_disease$num <- NULL
print(heart_disease)
Ans:
# Finding the relationship of "diagnosis" variable with firstly the numeric variables:
# We'll use box plots and t-test
# The numeric variables are: age, trestbps, chol, thalach, oldpeak, ca
boxplot(heart_disease$age ~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Age")
boxplot(heart_disease$trestbps ~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Trestbps")
boxplot(heart_disease$chol ~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Chol")
boxplot(heart_disease$thalach ~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Thalach")
boxplot(heart_disease$oldpeak~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Oldpeak")
boxplot(heart_disease$ca ~ heart_disease$diagnosis, xlab = "Diagnosis", ylab = "Ca")
# Performing a t-test now for the continuous variables onlywhich are: trestbps, chol, thalach, oldpeak
trestbps_t_test <- t.test(trestbps ~ diagnosis, data = heart_disease)
chol_t_test <- t.test(chol~ diagnosis, data = heart_disease)
thalach_t_test <- t.test(thalach ~ diagnosis, data = heart_disease)
oldpeak_t_test <- t.test(oldpeak ~ diagnosis, data = heart_disease)
print(trestbps_t_test)
Welch Two Sample t-test
data: trestbps by diagnosis
t = -3.1878, df = 858.85, p-value = 0.001485
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-6.56889 -1.56247
sample estimates:
mean in group No mean in group Yes
129.9130 133.9787
print(chol_t_test)
Welch Two Sample t-test
data: chol by diagnosis
t = 7.4756, df = 830.75, p-value = 1.951e-13
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
37.92323 64.92815
sample estimates:
mean in group No mean in group Yes
227.9056 176.4799
print(thalach_t_test)
Welch Two Sample t-test
data: thalach by diagnosis
t = 12.633, df = 837.18, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
17.34784 23.72998
sample estimates:
mean in group No mean in group Yes
148.8005 128.2616
print(oldpeak_t_test)
Welch Two Sample t-test
data: oldpeak by diagnosis
t = -12.763, df = 780.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-0.9742705 -0.7145329
sample estimates:
mean in group No mean in group Yes
0.4182051 1.2626068
# We'll perform Kruskal-Wallis test for ordinal numerical values: restcg, slope, thal
restecg_kt <- kruskal.test(restecg ~ diagnosis, data = heart_disease)
slope_kt <- kruskal.test(slope ~ diagnosis, data = heart_disease)
thal_kt <- kruskal.test(thal ~ diagnosis, data = heart_disease)
print(restecg_kt)
Kruskal-Wallis rank sum test
data: restecg by diagnosis
Kruskal-Wallis chi-squared = 5.4566, df = 1, p-value = 0.01949
print(slope_kt)
Kruskal-Wallis rank sum test
data: slope by diagnosis
Kruskal-Wallis chi-squared = 76.163, df = 1, p-value < 2.2e-16
print(thal_kt)
Kruskal-Wallis rank sum test
data: thal by diagnosis
Kruskal-Wallis chi-squared = 101.43, df = 1, p-value < 2.2e-16
Again, the p-values are less than 0.05. Therefore, we’ll reject our null hypothesis and can say that their is an association.
# Now, we'll use mosaic plots and chi-sqaure test for the categorical values
# The categorical values are: sex, cp, fbs, restcg, exang, slope, thal
# Constructing mosaic plots
mosaicplot(table(heart_disease$sex, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$cp, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$fbs, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$restecg, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$exang, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$slope, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
mosaicplot(table(heart_disease$thal, heart_disease$diagnosis), main = "Mosaic Plot", color = c("lightblue", "pink"))
# Now, we'll use chi-square test
sex_chisq <- chisq.test(table(heart_disease$sex, heart_disease$diagnosis))
cp_chisq <- chisq.test(table(heart_disease$cp, heart_disease$diagnosis))
fbs_chisq <- chisq.test(table(heart_disease$fbs, heart_disease$diagnosis))
restecg_chisq <- chisq.test(table(heart_disease$restecg, heart_disease$diagnosis))
exang_chisq <- chisq.test(table(heart_disease$exang, heart_disease$diagnosis))
slope_chisq <- chisq.test(table(heart_disease$slope, heart_disease$diagnosis))
thal_chisq <- chisq.test(table(heart_disease$thal, heart_disease$diagnosis))
print(sex_chisq)
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_disease$sex, heart_disease$diagnosis)
X-squared = 85.361, df = 1, p-value < 2.2e-16
print(cp_chisq)
Pearson's Chi-squared test
data: table(heart_disease$cp, heart_disease$diagnosis)
X-squared = 268.35, df = 3, p-value < 2.2e-16
print(fbs_chisq)
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_disease$fbs, heart_disease$diagnosis)
X-squared = 16.112, df = 1, p-value = 5.972e-05
print(restecg_chisq)
Pearson's Chi-squared test
data: table(heart_disease$restecg, heart_disease$diagnosis)
X-squared = 11.712, df = 2, p-value = 0.002863
print(exang_chisq)
Pearson's Chi-squared test with Yates' continuity correction
data: table(heart_disease$exang, heart_disease$diagnosis)
X-squared = 184.02, df = 1, p-value < 2.2e-16
print(slope_chisq)
Pearson's Chi-squared test
data: table(heart_disease$slope, heart_disease$diagnosis)
X-squared = 88.852, df = 2, p-value < 2.2e-16
print(thal_chisq )
Pearson's Chi-squared test
data: table(heart_disease$thal, heart_disease$diagnosis)
X-squared = 109.05, df = 2, p-value < 2.2e-16
According to the p-values coming, We can deduce that we should accept the alternate hypothesis.